Chapter 14: Reading and writing text files

In this chapter you will learn how to read data from files, do some analysis and write the results to disk.

Reading and writing files is quite an essential part of programming as it is the first step for your program to communicate with the outside world. In most cases you will write programs that take data from some source, manipulates it in someway and writes some results out somewhere.

For example if you would write a survey, you could take input from participants on a webserver and save their answers in some files or in a database. When the survey is over you would read these results in and do some analysis on the data you have collected, maybe do some visualizations and save your results.

In NLP, you often process files containing raw texts with some code and write the results to some other file.

At the end of this chapter, you will be able to:

  • open one or multiple text files
  • work with the modules os and glob
  • read the contents of a file
  • write new or manipulated content to new (or existing) files
  • close a file

Acknowledgements:

We use some materials from this other Python course.

If you have any questions about this chapter, please refer to the forum on Canvas.

1. Reading a file

In Python, you can read the content of a file, store it as the type of object that you need (string, list, etc.) and manipulate it (e.g. replacing or removing words). You can also write new content to an existing or a new file.

Here, we will discuss how to:

  • open a file
  • read in the content
  • store the context in a variable (to do something) e.g. as a string or list
  • close the file

1.1. File paths

To open a file, we need to associate the file on disk with a variable in Python. First, we tell Python where the file is stored on your disk. The location of your file is often referred to as the file path.

Python will start looking in the 'working' or 'current' directory (which often will be where your Python script is). If it's in the working directory, you only have to tell Python the name of the file (e.g. charlie.txt). If it's not in the working directory, as in our case, you have to tell Python the exact path to your file. We will create a string variable to store this information:


In [ ]:
filename = "../Data/Charlie/charlie.txt"  

# The double dots mean 'go up one level in the directory tree'.

Sometimes you see double dots in the beginning of the file path; this means 'the parent of the current directory'. When writing a file path, you can use the following:

  • / means the root of the current drive;
  • ./ means the current directory;
  • ../ means the parent of the current directory.

Consider the directory tree below.

  • If you want to go from your current working directory (cwd) to the one directly above (dir3), your path is ../.
  • If you want to go to dir1, you path is ../../
  • If you want to go to dir5, your path is ../dir5/
  • If you want to go to dir2, your path is ../../dir2/

You will learn how to navigate your directory tree quite intuitively with a bit of practice. If you have any doubts, it is always a good idea to follow a quick tutorial on basic command line operations.

Navigating your directory tree on Windows

Also note that the formatting of file paths is different across operating systems. The file path as specified above should work on any UNIX platform (Linux, Mac). If you are using Windows, however, you might run into problems when formatting file paths in this way outside of this notebook, because Windows uses backslashes instead of forward slashes (Jupyter Notebook should already have taken care of these problems for you). In that case, it might be useful to have a look at this page about the differences between the file systems, and at this page about solving this problem in Python. In short, it's probably best if you use the code below (we will talk about the os module in more detail later today). This is very useful to know if you are a Windows user, and it will become relevant for the final assignment.


In [ ]:
# For windows: 
import os
windows_file_path = os.path.normpath("C:/somePath/someFilename") # Use forward slashes

1.2 Opening a file

We can use the file path to tell Python which file to open by using the built-in function open(). The open() function does not return the actual text that is saved in the text file. It returns a 'file object' from which we can read the content using the .read() function (more on this later). We pass three arguments to the open() function:

  • the path to the file that you wish to open
  • the mode, a combination of characters explaining the purpose of the file opening (like read or write) and type of content stored in the file (like textual or binary format). For instance, if we are reading a plain text file, we can use the characters 'r' (represents read-mode) and 't' (represents plain text-mode).
  • the last argument, a keyword argument (encoding), specifies the encoding of the text file. The encoding basically is useful when reading non-English characters, but you can forget about this for now.

The most important mode arguments the open() function can take are:

  • r = Opens a file for reading only. The file pointer is placed at the beginning of the file.
  • w = Opens a file for writing only. Overwrites the file if the file exists. If the file does not exist, creates a new file for writing.
  • a = Opens a file for appending. The file pointer is at the end of the file if the file exists. If the file does not exist, it creates a new file for writing. Use it if you would like to add something to the end of a file

Then, to open the file 'charlie.txt' for reading purposes, we use the following:


In [ ]:
filepath = "../Data/Charlie/charlie.txt"  
infile = open(filepath, "r") # 'r' stands for READ mode
# Do something with the file
infile.close() # Close the file (you can ignore this for now)

Overview of possible mode arguments (the most important ones are 'r', 'w' and 'w'):

Character Meaning
'r' open for reading (default)
'w' open for writing, truncating the file first
'x' open for exclusive creation, failing if the file already exists
'a' open for writing, appending to the end of the file if it exists
'b' binary mode
't' text mode (default)
'+' open a disk file for updating (reading and writing)
'U' universal newlines mode (deprecated)

We could also directly use the path in the `open()``;


In [ ]:
infile = open("../Data/Charlie/charlie.txt" , "r")
infile.close()

So far, we have opened the file. This, however, does not yet show us the file content. Try printing 'infile':


In [ ]:
infile = open("../Data/Charlie/charlie.txt" , "r")
print(infile)
infile.close()

This TextIOWrapper thing is Python's way of saying it has opened a connection to the file charlie.txt. To actually see its content, we need to tell python to read the file.

1.3 Reading a file

Here, we will discuss three ways of reading the contents of a file:

  • read()
  • readlines()
  • readline()

1.3.1 read()

The read() method is used to access the entire text in a file, which we can assign to a variable. Consider the code below.

The variable content now holds the entire content of the file charlie.txt as a single string and we can access and manipulate it just like any other string. When we are done with accessing the file, we use the close() method to close the file.


In [ ]:
# Opening the file using the filepath and and the 'read' mode:
infile = open("../Data/Charlie/charlie.txt" , "r")

# Reading the file using the `read()` function and assigning it to the variable `content`
content = infile.read()
print(content)
print()
print('This function returns a', type(content))

# closing the file (more on this below)
infile.close()

1.3.2 readlines()

The readlines() function allows you to access the content of a file as a list of lines. This means, it splits the text in a file at the new lines characters ('\n') for you):


In [ ]:
# Opening the file using the filepath and and the 'read' mode:
infile = open("../Data/Charlie/charlie.txt" , "r")

# Reading the file using the `read()` function and assigning it to the variable `content`
lines = infile.readlines()
print(lines)
print()
print('This function returns a', type(lines))

# closing the file 
infile.close()

Now you can, for example, use a for-loop to print each line in the file (note that the second line is just a newline character):


In [ ]:
for line in lines:
    print("LINE:", line)

Important note

When we open a file, we can only use one of the read operations once. If we want to read it again, we have to open a new file variable. Consider the code below:


In [ ]:
infile = open("../Data/Charlie/charlie.txt" , "r")
content = infile.read()
lines = infile.readlines()
print(content)
print(lines)
infile.close()

The code returns an empty list. To fix this, we have to open the file again:


In [ ]:
filepath = "../Data/Charlie/charlie.txt"

infile = open(filepath , "r")
content = infile.read()
infile = open(filepath, "r")
lines = infile.readlines()
print(content)
print(lines)
infile.close()

1.3.3 Readline()

The third operation readline() returns the next line of the file, returning the text up to and including the next newline character (\n, or \r\n on Windows). More simply put, this operation will read a file line-by-line. So if you call this operation again, it will return the next line in the file. Try it out below!


In [ ]:
filepath = "../Data/Charlie/charlie.txt"

infile = open(filepath, "r")
next_line = infile.readline()
print(next_line)

In [ ]:
next_line = infile.readline()
print(next_line)

In [ ]:
next_line = infile.readline()
print(next_line)
infile.close()

Which function to choose

For small files that you want to load entirely, you can use one of these three methods (readline, read, or readlines). Note, however, that we can also simply do the following to read a file line by line (this is recommended for larger files and when we are really only interested in a small portion of the file):


In [ ]:
infile = open(filename, "r")
for line in infile:
    print(line)
infile.close()

Note the last line of this code snippet: infile.close(). This closes our file, which is a very important operation. This prevents Python of keeping files that are unneccessary anymore still open. In the next subchapter we will also see a more convenient way to ensure files get closed after their usage.

1.4. Closing the file

Here, we will intorduce closing a file with the method close() and using a context manager to open and close files. After reading the contents of a file, the TextWrapper no longer needs to be open since we have stored the content as a variable. In fact, it is good practice to close the file as soon as you do not need it anymore.

1.4.1 close()

We do this by using the close() method as already shown several times above.


In [ ]:
filepath = "../Data/Charlie/charlie.txt"

# open file
infile = open(filepath , "r")

# assign content to a varialbe
content = infile.read()

# close file
infile.close()


# do whatever you want with the context, e.g. print it:

print(content)

1.4.2 Using a context manager

There is actually an easier (and preferred) way to make sure that the file is closed as soon as you don't need it anymore, namely using what is called a context manager. Instead of using open() and close(), we use the syntax shown below.

The main advantage of using the with-statement is that it automatically closes the file once you leave the local context defined by the indentation level. If you 'manually' open and close the file, you risk forgetting to close the file. Therefore, context managers are considered a best-practice, and we will use the with-statement in all of our following code.


In [ ]:
filepath = "../Data/Charlie/charlie.txt"

with open(filepath, "r") as infile:
    content = infile.read()
    
print(content)

2 Manipulating file content

Once your file content is loaded in a Python variable, you can manipulate its content as you can manipulate any other variable. You can edit it, add/remove lines, count word occurences, etc. Let's say we read the file content in a list of its lines as shown below. Note that we can use all of the different methods for reading files in the context manager.


In [ ]:
filepath = "../Data/Charlie/charlie.txt"

with open(filepath, "r") as infile:
    lines = infile.readlines()
    
print(lines)

Then we can for instance preserve only the first 2 lines of the file, in a new variable:


In [ ]:
first_two_lines=lines[:2]
first_two_lines

We can count the lines that are longer than 15 characters:


In [ ]:
counter=0
for line in lines:
    if len(line)>15:
        counter+=1
print(counter)

We will soon see how to perform text processing once we have loaded the file, by using an external module in the next chapter. But let's first write our modified file back to disk to preserve the changes.

3 Writing files

To write content to a file, we can open a new file and write the text to this file by using the write() method. Again, we can do this by using the context manager. Remember that we have to specify the mode using w.

Let's first slightly adapt our Charlie story by replacing the names in the text:


In [ ]:
filepath = "../Data/Charlie/charlie.txt"

# read in file and assign content to the variable content
with open(filepath, "r") as infile:
    content = infile.read()
    
# manipulate content

your_name = "x y" #type in your name 
friends_name = "a b" #type in the name of a friend 

# Replace all instances of Charlie Bucket with your name and save it in new_content
new_content = content.replace("Charlie Bucket", your_name)

# Replace all instancs of Mr Wonka with your friends name and save it in new_new_content
new_new_content = new_content.replace("Mr Wonka", friends_name)

We can now save the manipulated content to a new file:


In [ ]:
filename = "../Data/Charlie/charlie_new.txt"
with open(filename, "w") as outfile:
    outfile.write(new_new_content)

Open the file charle_new.txt in the folder ../Data/Charlie in any text editor and read a personalized version of the story!

Note about append mode (a):

The third mode of opening a file is append ('a'). If the file 'charlie_new.txt' does not exist, then append and write act the same: they create this new file and fill it with content. The difference between write and append occurs when this file would exist. In that case, the write mode overwrites its content, while the append mode adds the new content at the end of the existing one.

4 Reading and writing multiple files

You will often have multiple files to work with. The folder ../Data/Dreams contains 10 text files describing dreams of Vickie, a 10-year-old girl. These texts are extracted from DreamBank.

To process multiple files, we often want to iterate over a list of files. These files are usually stored in one or multiple directories on your computer.

Instead of writing out every single file path, it is much more convenient to iterate over all the files in the directory ../Data/Dreams. So we need to find a way to tell Python: "I want to do something with all these files at this location!"

There are two modules which make dealing with multiple files a lot easier.

  • glob
  • os

We will introduce them below.

4.1 The glob module

The glob module is very useful to find all the pathnames matching a specified pattern according to the rules used by the Unix shell. You can use two wildcards: the asterisk (*) and the question mark (?). An asterisk matches zero or more characters in a segment of a name, while the question mark matches a single character in a segment of a name.

For example, the following code gives all filenames in the directory ../Data/dreams:


In [ ]:
import glob

In [ ]:
for filename in glob.glob("../Data/Dreams/*"):
    print(filename)

If we only want to consider text files and ignore everything else (here a file called 'IGNORE_ME!'), we can specify this in our search by only looking for files with the extension .txt:


In [ ]:
for filename in glob.glob("../Data/Dreams/*.txt"):
    print(filename)

A question mark (?) matches any single character in that position in the name. For example, the following code prints all filenames in the directory ../Data/dreams that start with 'vickie' followed by exactly 1 character and end with the extension .txt (note that this will not print vickie10.txt):


In [ ]:
for filename in glob.glob("../Data/Dreams/vickie?.txt"):
    print(filename)

You can also find filenames recursively by using the pattern ** (the keyword argument recursive should be set to True), which will match any files and zero or more directories and subdirectories. The following code prints all files with the extension .txt in the directory ../Data/ and in all its subdirectories:


In [ ]:
for filename in glob.glob("../Data/**/*.txt", recursive=True):
    print(filename)

4.2 The os module

Another module that you will frequently see being used in examples is the os module. The os module has many features that can be very useful and which are not supported by the glob module. We will not go over each and every useful method here, but here's a list of some of the things that you can do (some of which we have seen above):

  • creating single or multiple directories: os.mkdir(), os.mkdirs();
  • removing single or multiple directories: os.rmdir(), os.rmdirs();
  • checking whether something is a file or a directory: os.path.isfile(), os.path.isdir();
  • split a path and return a tuple containing the directory and filename: os.path.split();
  • construct a pathname out of one or more partial pathnames: os.path.join()
  • split a filename and return a tuple containing the filename and the file extension: os.path.splitext()
  • get only the basename or the directory path: os.path.basename(), os.path.dirname().

Feel free to ply around with these methods and figure out how they work yourself :-)


In [ ]:
# Start by importing the module:
import os

# let's use a filepath for testing it out:
filepath = "../Data/Charlie/charlie.txt"
os.path.basename(filepath)

Exercises

Exercise 1:

Write a program that opens RedCircle.txt in the ../Data/RedCircle folder and prints its content as a single string:


In [ ]:
# your code here

Exercise 2:

Write a program that opens RedCircle.txt in the ../Data/RedCircle folder and prints a list containing all lines in the file:


In [ ]:
# your code here

Exercise 3:

Create a counter dictionary like in block 2 (the dictionaries chapter), where you will count the number of occurences of each word in a file.


In [ ]:
# your code here

Exercise 4:

The module os implements functions that allow us to work with the operating system (see folder contents, change directory, etc.). Use the function listdir from the module os to see the contents of the current directory. Then print all the items that do not start with a dot.


In [ ]:
# your code here